IRIT at TREC Temporal Summarization 2014

نویسندگان

  • Rafik Abbes
  • Karen Pinel-Sauvagnat
  • Nathalie Hernandez
  • Mohand Boughanem
چکیده

This paper describes the IRIT lab participation to the 2014 TREC Temporal Summarization track. The goal of the Temporal Summarization track is to develop systems that allow users to efficiently monitor information about events over time. Our proposed method selects relevant documents that are more likely to concern the event, and extracts relevant and novel sentences based on some filters. Obtained results are presented and discussed. 1 Presentation of the task The aim of the Temporal Summarization (TS) track is to develop systems that allow users to efficiently monitor information about events. This year, the track run only one task which requires systems to iterate over a stream corpus in a chronological order and filter relevant and novel sentences to a developing event. A specially filtered subset of the full TREC 2014 StreamCorpus was provided. It consists of about 20 million documents from several sources (News, Social, Forum, Blog, etc.) having a size of 559 GB (compressed). Each document is identified by a stream id that consists of two dash-separated parts: timestamp and doc id. This year, 15 topics were evaluated. Each topic represents an event characterized by a title, a Wikipedia URL, a period, a query and a type (accident, storm, bombing, riot, protest, impact event, shooting). For each event, a system should emit a set of timestamped sentences called updates to generate the event summary. Ground truth, called nuggets, corresponds to a set of sentences extracted from Wikipedia by the track annotators. Matching updates to nuggets was done by track assessors. A nugget and an update are matched if they refer to the same information. To evaluate systems effectiveness, track organizers define two metrics: the Expected Latency Gain (ELG) and the Latency Comprehensiveness (LC) which are similar to the traditional IR notions of Precision and Recall (respectively). Systems are ranked based on the harmonic mean between ELG and LC. 2 IRIT method for temporal summarization Our system is based on Algorithm 1 which is similar to the one given in the track guidelines, used as reference in some methods at TREC TS 2013 [1, 2]. Given an event query Qe and its corresponding period (start time ts, end time te), our system iterates over the stream corpus in a chronological order, hour by hour (line 2 of the algorithm). We can distinguish 3 basic steps: the first one, done just once, build a generic event model containing a bag of weighted terms related to events (line 1 of the algorithm). Step 2 and 3 are repeated iteratively for each hour. In step 2, our system has to decide which documents 1 http://s3.amazonaws.com/aws-publicdatasets/trec/ts/index.html 2 http://www.trec-ts.org/documents Algorithm 1 Temporal Summarization algorithm Input: C : Time-ordered corpus Input: Qe : Event query terms Input: ts : Event start time Input: te : Event end time Input: Ntrain : Set of training nuggets Output: U ← {} 1: θE ← BuildGenericEventModel(Ntrain) 2: for h ∈ [toHour(ts), toHour(te)] do 3: Dh ← getRelevantDocuments(h,Qe, topHits) 4: for d ∈ Dh do 5: for s ∈ d do 6: if isUsefulSentence(s, U) then 7: U.append(u) 8: end if 9: end for 10: end for 11: end for should be kept in order to extract updates (line 3 of the algorithm), and in the last step, it attempts to detect relevant and novel sentences related to the event (line 6 of the algorithm). These steps are detailed below. 2.1 Generic event model We hypothesize that updates related to events tend to contain a specific vocabulary of terms independent of the event type (storm, hurricane, bombing, etc.) such as victims, injuries, deaths, emergency, etc. We call these terms keywords. We assume that we can build a generic event model by leveraging a set of nuggets related to a sample of events. Specifically, considering a set of training nuggets Ntrain related to m events, we estimate the generic event model θE composed of terms t as follows: P (t|θE) = TF (t,Ntrain) log( m EF (t) ) Where TF (t,Ntrain) is the term frequency of term t in the training nuggets Ntrain. EF (t) represents the number of events containing term t in at least one nugget. Thus, a term is more weighted when it appears in most of the training events. 2.2 Document selection To reject documents that are more likely to be not relevant, we apply the following filters: – Source filter : Based on some analysis done on the last year’s results (TS 2013), we noticed that 95% of relevant document (i.e., having at least one relevant sentence) come from one of the following sources: WEBLOG, MAINSTREAM NEWS and news. For this reason, we reject all documents coming from other sources. – Title filter : 95% of relevant document in TS2013 have a title. We therefore reject documents without titles. – Duplication filter : We reject also duplicate documents having the same doc id. In each hour, we keep only the topHits filtered documents based on the following score:

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

IRIT at TREC Real Time Summarization 2016

This paper presents the participation of the IRIT laboratory (University of Toulouse) to the Real Time Summarization track of TREC 2016. This track consists in a real-time filtering the tweet stream and identifying both relevant and novel tweets to be pushed to user in real-time. Our team proposes three different approaches: (1) The first approach consist of a filtering model that combines seve...

متن کامل

ISCASIR at TREC 2015 Temporal Summarization Track

The goal of Temporal Summarization task is to develop systems which can detect useful, new, and timely sentence-length updates about a developing event. This paper describes our participation in Temporal Summarization track of TREC2015. Based on the word embedding technique, we submitted two runs for the summarization task. The query expanding technique is used for the first run and relevant se...

متن کامل

IRIT at TREC Real-Time Summarization 2017

This paper presents the participation of the IRIT laboratory (University of Toulouse) to the Real-Time Summarization track of TREC RTS 2017. This track aims at exploring prospective information needs over document streams containing novel and evolving information and it consists of two scenarios ( A: push notification and B: Email digest). In this year the live mobile assessment was made availa...

متن کامل

Summarizing tweet in real-time by filtering quality, relevant and non redundant tweets

This paper presents the participation of LIRMM laboratory (University of Montpellier), P3 Group and IRIT laboratory (University of Toulouse) to the Real Time Summarization track of TREC 2017. We extend our previous approach [1] for real-time filtering of tweet stream that aims to identify quality, relevant and non-redundant tweets to be pushed to the user at real-time. We describe in this paper...

متن کامل

ZZISTI at TREC2013 Temporal Summarization Track

Our team submitted runs for the first running of the TREC Temporal Summarization track. TS Track at TREC2013 contains two tasks, namely Sequential update Summarization and value tracking. Our Systems to each task are described in this paper respectively. In particular, Stanford CoreNLP was applied to extract the event attributes.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014